Localization: stop the AI translator stripping quotes that are part of a translated value#25721
Merged
jkmassel merged 3 commits intoJun 30, 2026
Conversation
A translation whose value is itself wrapped in quotation marks must keep them; only the model's cosmetic wrapping around a raw single-string reply should be stripped. Cover both structured paths (translate_plural, translate_all).
clean() removes the cosmetic quotes a model wraps around a raw single-string reply. The plural and batch paths ran it on values already decoded by JSON.parse, so a value whose own content is quoted (e.g. "Reader") lost its quotes too. Run clean() only on the raw single-string reply in translate(); the JSON-decoded plural/batch values are whitespace-trimmed but never quote-stripped, since JSON.parse has already removed the structural quotes and anything left is content. Also covers the async collect_batch path (shares validated_batch) and curly-quoted values for free. Satisfies the tests added in d923c25.
…ervation tests Two more regression guards for the clean()-on-decoded-value fix: a curly/smart-quoted value (“Reader”) through translate_all, since clean() strips “ ” as well as straight quotes; and a quoted value through the async collect_batch path, which shares validated_batch with translate_all. Both fail against the pre-fix code, so a narrower fix — only un-stripping straight quotes, or only the sync path — cannot slip past.
jkmassel
approved these changes
Jun 30, 2026
65bc4df
into
jkmassel/claude-string-translation
26 checks passed
pull Bot
pushed a commit
to kliu/WordPress-iOS
that referenced
this pull request
Jul 1, 2026
* Localization: AI translation primitives Reusable, unit-tested Ruby primitives for the AI translation tier of the localization pipeline — the service behind the `human ?? AI ?? English` floor whose AI stub was left open in wordpress-mobile#25688. Pure prompt-building and validation with the Anthropic SDK call injected, so the logic is testable without the gem or the network. Not wired into any lane yet. - TranslationValidator: format-specifier safety gate — a translation must preserve the source's placeholders (count and type; positional reordering allowed), or it is rejected and falls back to English. - Glossary: brand do-not-translate list plus per-locale terms and register. - AITranslator: single-string, per-key plural form-set (one consistent stem across CLDR forms), and batched string translation, with structured-output (output_config) enforcement. - AnthropicBatch: Message Batches submit/await/results/collect for bulk backfill. 50 unit tests, rubocop clean. * Localization: run the AI translation tooling unit tests in CI The pure-Ruby unit suites (TranslationValidator, Glossary, AnthropicBatch, AITranslator) weren't executed by any pipeline step — the "Unit Tests" jobs are the Xcode/XCTest suites, and rubocop (via Danger) only lints them. Add a lightweight Buildkite step that runs each fastlane/lanes/*_test.rb with plain ruby (stdlib minitest — no Xcode, no app build, no bundle). Runs unconditionally rather than behind should-skip-job.sh --job-type validation, which skips on tooling-only changes — i.e. exactly the PRs that touch these files. * Localization: correct the for_plural docstring The previous note advertised for_plural as a one-line swap to wire the live translation tier. That path routes each plural form through single-string translate, so it forfeits the cross-form consistency translate_plural exists to provide — the lemma drift PLURAL_OUTPUT warns about. Relabel for_plural as the per-cell fallback and point the live-tier wiring at translate_plural's form-set seam. * Localization: stop the AI translator stripping quotes that are part of a translated value (wordpress-mobile#25721) * Localization: assert clean() preserves quotes that are part of a value A translation whose value is itself wrapped in quotation marks must keep them; only the model's cosmetic wrapping around a raw single-string reply should be stripped. Cover both structured paths (translate_plural, translate_all). * Localization: stop clean() stripping quotes that are part of a value clean() removes the cosmetic quotes a model wraps around a raw single-string reply. The plural and batch paths ran it on values already decoded by JSON.parse, so a value whose own content is quoted (e.g. "Reader") lost its quotes too. Run clean() only on the raw single-string reply in translate(); the JSON-decoded plural/batch values are whitespace-trimmed but never quote-stripped, since JSON.parse has already removed the structural quotes and anything left is content. Also covers the async collect_batch path (shares validated_batch) and curly-quoted values for free. Satisfies the tests added in d923c25. * Localization: cover curly quotes and the batch path in the quote-preservation tests Two more regression guards for the clean()-on-decoded-value fix: a curly/smart-quoted value (“Reader”) through translate_all, since clean() strips “ ” as well as straight quotes; and a quoted value through the async collect_batch path, which shares validated_batch with translate_all. Both fail against the pre-fix code, so a narrower fix — only un-stripping straight quotes, or only the sync path — cannot slip past. --------- Co-authored-by: Jeremy Massel <1123407+jkmassel@users.noreply.github.com> * Localization: rename validated_batch/validated_forms to select_valid_* At call sites the validated_ prefix reads as an adjective — "the batch that's already been validated" — when both methods are in fact where batch and plural-set translations run the placeholder gate, returning only the passing subset. select_valid_batch / select_valid_forms make the filtering action plain where they're called. Pure rename of two private helpers; no behavior change. --------- Co-authored-by: Oguz Kocer <oguzkocer@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Targeting #25705.
The AI translation tier could drop quotation marks that are part of a translated value.
translate_pluralandtranslate_allranclean()— which removes the cosmetic quotes a model wraps around a raw reply — on values already decoded byJSON.parse, so a value whose own content is quoted (e.g."Reader") lost its quotes too.Fix
Run
clean()only on the raw single-string reply intranslate(). The JSON-decoded plural/batch values are whitespace-trimmed but never quote-stripped —JSON.parsehas already removed the structural quotes, so anything left is content. This also covers the asynccollect_batchpath (it sharesvalidated_batch) and curly“…”quotes.Test plan
"…"and curly“…”quotes, across both the sync (translate_plural/translate_all) and async (collect_batch) paths.ai_translator_test.rbsuite green (34 runs, 0 failures);test_returns_cleaned_translationstill pins the single-string cosmetic-quote stripping, so the fix has to be path-specific.